本次更新有几点:
相关字段确定及公式更新
dlvy_date出库时间、biz_model销售模型(分为自营和MP),分别对应不同的公式;
GMV÷(1+销售税率)-移动平均价(去税) *商品数量 ; MP:GMV*0.03÷(1+销售税率)+处方费÷1.06-(处方费÷3)*0.3;GMV÷(1+销售税率)-移动平均价(去税) *商品数量 ; MP:GMV*0.05÷(1+销售税率)+处方费÷1.06-(处方费÷3)*0.3;dlvy_date出库时间、biz_model销售模型(分为自营和MP),分别对应不同的公式
GMV÷(1+销售税率) ; MP:GMV*0.03÷(1+销售税率)+处方费÷1.06;GMV÷(1+销售税率) ; MP:GMV*0.05÷(1+销售税率)+处方费÷1.06;ordr_time, 支付口径:pay_date, 出库口径:dlvy_date。why user_id is missing in some orders?
后台下单(手工单)时,没有user_id,药师后台代下单, 三方运营重下三方订单(天猫、京东等)。
why ordr_detl_id and ordr_id are not unique?
pm_list_id是唯一的,购买数量比较多的药品时,ordr_detl_id可能会被拆成几个pm_list_id。下面会有pm_list_id的唯一性分析。
Why some orders can be found in the order table, but some others cannot?
已提供部分订单号给公司验证
many variables have negative or missing values, such as pm_num (order quantity), pm_use_coupn_amt, revenue, and gp_amt
部分字段出现负值完全是因为退货订单(rma_yn=1),以下有相关分析。
Why an order not cancelled can be invalid?
已提供部分订单号给公司验证
部分字段的真实取值跟dictionary对不上
已申请新的字段解释,正在审批中
部分字段废弃
revenue_freight、gp_amt_product 这两个字段废弃。
1)如果一个人买了又退,是不是有两条不同order_id的订单,一条rma_yn为0,一条为1?
同一个人的退货订单是向表里增加记录,会导致重复,更新pt_d时间,但是退货订单的商品数量、金额可能与原订单不同,顾客可能只退一部分货。
dictionary更新
向dictionary中增加了2个表 - 111标准指标库.csv:使用sheet:1药网,该表给出了各种需要计算的字段的公式,以此为准。 - 所有表字段值解释.csv : 该表给出了多个数据库表的字段值解释,比如B2C订单中订单状态A2N,A2Z,A2A分别表示什么含义
Add a column Nminus
summaryData = function(dat){
writeLines(paste("Number of obs:", nrow(dat)))
names = colnames(dat)
flags = matrix(NA,length(names),12)
colnames(flags) = c("NAs","Zeros","NZs",'Mean','SD',"Min",'Median',"Max","Unique","class", 'Size', 'Nminus')
rownames(flags) = names
for(name in names){
ix = dat[, !is.na(get(name))]
val = dat[ix, get(name)]
if(all(unique(val) %in% c('True','False'))) val = val=='True'
if(class(val[1])[1] == 'logical') val = as.numeric(val)
flags[name,1:3] = c(sum(!ix),sum(val==0),sum(val!=0))
if(class(val[1])[1] %in% c("numeric","integer")){
flags[name,4:8] = round(c(mean(val),sd(val),min(val),median(val),max(val)),digits=2)
flags[name, 12] = sum(val<0)
}
flags[name, 9] = length(unique(val))
flags[name, 10] = class(val[1])[1]
flags[name, 11] = object.size(val) / (1024^2) # MB
}
rm(val)
return(flags)
}orders = readRDS('E:/University of Connecticut/Peng, Jing - 111/data/cleaned/orders.RDS')
orders = orders %>% mutate_if(is.numeric , replace_na, replace = 0)
orders$dlvy_date = as.POSIXct(orders$dlvy_date, format="%Y-%m-%d", tz='UTC')
orders$ordr_time = as.POSIXct(orders$ordr_time, format="%Y-%m-%d", tz='UTC')
# ZZ = summaryData(orders)
ZZ = fread('E:/University of Connecticut/Peng, Jing - 111/data/Guan/summaryData_orders.csv')
datatable(ZZ, filter = 'top', options = list(pageLength = 15), caption = 'Table 1: The summary of B2C orders.')dlvy_date出库时间、biz_model销售模型(分为自营和MP),分别对应不同的公式
GMV÷(1+销售税率) ; MP:GMV*0.03÷(1+销售税率)+处方费÷1.06;GMV÷(1+销售税率) ; MP:GMV*0.05÷(1+销售税率)+处方费÷1.06;# dlvy_date NA值共有1917235行
orders[biz_model != 1, cal_revenue := (abs(gmv)/(1+abs(taxrate)))] #自营
orders[dlvy_date <= as.POSIXct('2020-1-1', format="%Y-%m-%d", tz='UTC') & biz_model ==1, cal_revenue := (abs(gmv)*0.03/(1+abs(taxrate))+ abs(prescription_amount)/1.06)] # MP, 2020.1.1之前
orders[dlvy_date > as.POSIXct('2020-1-1', format="%Y-%m-%d", tz='UTC') & biz_model ==1, cal_revenue := (abs(gmv)*0.05/(1+abs(taxrate))+ abs(prescription_amount)/1.06)] # MP, 2020.1.1之后
orders[is.na(dlvy_date) & biz_model ==1, cal_revenue := (abs(gmv)*0.03/(1+abs(taxrate))+ abs(prescription_amount)/1.06)] # MP, dlvy_date is NA
orders[, isEqual_R := ifelse((abs(revenue)-abs(cal_revenue))<1,1,0)]
orders[,table(isEqual_R)]isEqual_R
0 1
178365 14694304
Marketing = 市场部优惠券金额(mkt_coupn_amt)+促销满减(pm_pmtn_amt)+促销满赠(pmtn_full_amt)+使用账户余额金额(use_acct_amt)+MP商家优惠券金额(mp_coupn_amt) + 官网运营费(of_web_coupn_amt) +类目运营费(cat_coupn_amt)+人事部(hr_coupn_amt)+药师费(pamst_coupn_amt)+分摊后的激励金(intv_pay)+使用返利余额金额(use_re_sum_amt)+其他部门优惠券金额(空)
orders[, cal_marketing := (mkt_coupn_amt+pm_pmtn_amt+pmtn_full_amt+use_acct_amt+mp_coupn_amt+of_web_coupn_amt+cat_coupn_amt+hr_coupn_amt+pamst_coupn_amt+intv_pay+use_re_sum_amt)][, isEqual_M := ifelse((abs(marketing)-abs(cal_marketing))<1,1,0)]
orders[,table(isEqual_M)]isEqual_M
0 1
21043 14851626
dlvy_date出库时间、biz_model销售模型(分为自营和MP),分别对应不同的公式;
GMV÷(1+销售税率)-移动平均价(去税) *商品数量 ; MP:GMV*0.03÷(1+销售税率)+处方费÷1.06-(处方费÷3)*0.3;GMV÷(1+销售税率)-移动平均价(去税) *商品数量 ; MP:GMV*0.05÷(1+销售税率)+处方费÷1.06-(处方费÷3)*0.3;orders[biz_model != 1, cal_gp := (abs(gmv)/(1 + abs(taxrate)) - abs(move_avg_prc_notax)*abs(pm_num))]# 自营
orders[dlvy_date <= as.POSIXct('2020-1-1', format="%Y-%m-%d", tz='UTC') & biz_model ==1,cal_gp := (abs(gmv)*0.03)/(1+abs(taxrate)) + abs(prescription_amount)/1.06 - (abs(prescription_amount)/3)*0.3]# MP, 2020.1.1之前
orders[dlvy_date > as.POSIXct('2020-1-1', format="%Y-%m-%d", tz='UTC') & biz_model ==1,cal_gp := (abs(gmv)*0.05)/(1+abs(taxrate)) + abs(prescription_amount)/1.06 - (abs(prescription_amount)/3)*0.3]# MP, 2020.1.1之后
orders[is.na(dlvy_date) & biz_model ==1,cal_gp := (abs(gmv)*0.03)/(1+abs(taxrate)) + abs(prescription_amount)/1.06 - (abs(prescription_amount)/3)*0.3]# MP, dlvy_date is NA
orders[, isEqual_G := ifelse((abs(gp_amt)-abs(cal_gp))<1,1,0)]
orders[,table(isEqual_G)]isEqual_G
0 1
371492 14501177
顾客实际付款金额(gmv) :商品数量 * 销售单价 + 运费金额 + 处方费 - Marketing;
orders[, cal_gmv := abs(pm_num)*abs(pm_price) + abs(freight) + abs(prescription_amount) - abs(marketing)]
orders[, isEqual_GMV := ifelse((abs(gmv)-abs(cal_gmv))<1,1,0)]
orders[, table(isEqual_GMV)]isEqual_GMV
0 1
295614 14577055
orders[,table((abs(act_sale)-abs(gmv)) < 1)]
FALSE TRUE
946046 13926623
orders[,table((abs(act_sale)-abs(cal_gmv)) < 1)]
FALSE TRUE
1231282 13641387
tc=orders[,.(wrong_revenue=length(which(isEqual_R==0)),wrong_marketing=length(which(isEqual_M==0)),wrong_gp_amt=length(which(isEqual_G==0))),by=ordr_time]
long_tc = melt(tc,id='ordr_time')
long_tc$value = log(long_tc$value)
long_tc$ordr_time = as.Date(long_tc$ordr_time)
p = ggplot(data = long_tc,aes(x=ordr_time,y=value,group=variable, color=variable))+
geom_line()+
labs(x = "Time", y = "Wrong value Count(log)") +
scale_x_date(date_breaks="1 months",date_labels="%b /%m")+
theme(axis.text.x = element_text(angle=45, hjust=1, vjust=1))
ggplotly(p)tc=orders[,.(wrong_revenue=length(which(isEqual_R==0)),wrong_marketing=length(which(isEqual_M==0)),wrong_gp_amt=length(which(isEqual_G==0))),by=ordr_stus]
long_tc = melt(tc,id='ordr_stus')
long_tc$value = log(long_tc$value)
ggplot(data = long_tc,aes(x=ordr_stus,y=value,group=variable, color=variable))+
geom_line()+
labs(x = "ordr_stus", y = "Wrong value Count(log)")+
scale_x_discrete(breaks=c("A2K", "A2N", "A2T", "A2Z", "PNP", "PNS"),
labels=c("已拆单", "订单已取消", "用户已签收", "订单完成", "人工审核不通过", "订单完成(人工审核)"))tc=orders[,.(wrong_revenue=length(which(isEqual_R==0)),wrong_marketing=length(which(isEqual_M==0)),wrong_gp_amt=length(which(isEqual_G==0))),by=sale_type_id]
long_tc = melt(tc,id='sale_type_id')
long_tc$value = log(long_tc$value)
ggplot(data = long_tc,aes(x=sale_type_id,y=value,group=variable, color=variable))+
geom_line()+scale_x_continuous(breaks=long_tc$sale_type_id)+
labs(x = "sale_type_id", y = "Wrong value Count(log)")tc=orders[,.(wrong_revenue=length(which(isEqual_R==0)),wrong_marketing=length(which(isEqual_M==0)),wrong_gp_amt=length(which(isEqual_G==0))),by=biz_model]
long_tc = melt(tc,id='biz_model')
long_tc$value = log(long_tc$value)
ggplot(data = long_tc,aes(x=biz_model,y=value,group=variable, color=variable))+geom_line()+scale_x_continuous(breaks=long_tc$biz_model)+labs(x = "biz_model", y = "Wrong value Count(log)")We found gp_amt and pm_num are negative when ram_yn equals to 1.
tc = orders[,
.(Nminus_gp_amt = length(which(gp_amt<0)),
Nminus_pm_num = length(which(pm_num<0)),
Nminus_cost_hastax = length(which(cost_hastax<0)),
Nminus_revenue = length(which(revenue<0)),
Nminus_gmv = length(which(gmv<0)),
Nminus_act_sale = length(which(act_sale<0)),
Nminus_acc_sale = length(which(acc_sale<0))),
by =rma_yn]
long_tc = melt(tc,id='rma_yn')
long_tc$value = log(long_tc$value)
ggplot(data = long_tc,aes(x=rma_yn,y=value,fill=variable))+
geom_bar(position = 'dodge',stat = 'identity')+
labs(x = "rma_yn", y = "Minus value Count(log)")+
scale_x_continuous(breaks=long_tc$rma_yn)+
scale_fill_brewer(palette = "Set3")tc = orders[,
.(Nminus_gp_amt = length(which(gp_amt<0)),
Nminus_pm_num = length(which(pm_num<0)),
Nminus_cost_hastax = length(which(cost_hastax<0)),
Nminus_revenue = length(which(revenue<0)),
Nminus_gmv = length(which(gmv<0)),
Nminus_act_sale = length(which(act_sale<0)),
Nminus_acc_sale = length(which(acc_sale<0))),
by =ordr_time]
long_tc = melt(tc,id='ordr_time')
long_tc$value = log(long_tc$value)
long_tc$ordr_time = as.Date(long_tc$ordr_time)
p = ggplot(data = long_tc,aes(x=ordr_time,y=value,group=variable, color=variable))+
geom_line()+
labs(x = "Time", y = "Minus value Count(log)")+
scale_x_date(date_breaks="1 months",date_labels="%b /%m")+
theme(axis.text.x = element_text(angle=45, hjust=1, vjust=1))
ggplotly(p)tc = orders[,
.(Nminus_gp_amt = length(which(gp_amt<0)),
Nminus_pm_num = length(which(pm_num<0)),
Nminus_cost_hastax = length(which(cost_hastax<0)),
Nminus_revenue = length(which(revenue<0)),
Nminus_gmv = length(which(gmv<0)),
Nminus_act_sale = length(which(act_sale<0)),
Nminus_acc_sale = length(which(acc_sale<0))),
by =ordr_stus]
long_tc = melt(tc,id='ordr_stus')
long_tc$value = log(long_tc$value)
ggplot(data = long_tc,aes(x=ordr_stus,y=value,group=variable, color=variable))+
geom_line()+
labs(x = "ordr_stus", y = "Minus value Count(log)")+
scale_x_discrete(breaks=c("A2K", "A2N", "A2T", "A2Z", "PNP", "PNS"),
labels=c("已拆单", "订单已取消", "用户已签收", "订单完成", "人工审核不通过", "订单完成(人工审核)"))tc = orders[,
.(Nminus_gp_amt = length(which(gp_amt<0)),
Nminus_pm_num = length(which(pm_num<0)),
Nminus_cost_hastax = length(which(cost_hastax<0)),
Nminus_revenue = length(which(revenue<0)),
Nminus_gmv = length(which(gmv<0)),
Nminus_act_sale = length(which(act_sale<0)),
Nminus_acc_sale = length(which(acc_sale<0))),
by =sale_type_id]
long_tc = melt(tc,id='sale_type_id')
long_tc$value = log(long_tc$value)
ggplot(data = long_tc,aes(x=sale_type_id,y=value,group=variable, color=variable))+
geom_line()+
scale_x_continuous(breaks=long_tc$sale_type_id)+labs(x = "sale_type_id", y = "Minus value Count(log)")tc = orders[,
.(Nminus_gp_amt = length(which(gp_amt<0)),
Nminus_pm_num = length(which(pm_num<0)),
Nminus_cost_hastax = length(which(cost_hastax<0)),
Nminus_revenue = length(which(revenue<0)),
Nminus_gmv = length(which(gmv<0)),
Nminus_act_sale = length(which(act_sale<0)),
Nminus_acc_sale = length(which(acc_sale<0))),
by =biz_model]
long_tc = melt(tc,id='biz_model')
long_tc$value = log(long_tc$value)
ggplot(data = long_tc,aes(x=biz_model,y=value,group=variable, color=variable))+
geom_line()+
labs(x = "biz_model", y = "Minus value Count(log)")+
scale_x_continuous(breaks=long_tc$biz_model, labels = long_tc$biz_model)rma_orders = orders[,.(rma_count = log(length(which(rma_yn==1)))), by=ordr_time]
rma_orders$ordr_time = as.Date(rma_orders$ordr_time)
p = ggplot(data = rma_orders,aes(x=ordr_time,y=rma_count))+
geom_line(colour='#FF4500')+
labs(x = "Time", y = "Return order Count(log)")+
scale_x_date(date_breaks="1 months",date_labels="%b /%m")+
theme(axis.text.x = element_text(angle=45, hjust=1, vjust=1))
ggplotly(p)商品数量
orders[,table(rma_yn, pm_num<0)]
rma_yn FALSE TRUE
0 14732118 0
1 0 140551
毛利额
orders[,table(rma_yn, gp_amt<0)]
rma_yn FALSE TRUE
0 13345935 1386183
1 18353 122198
含税成本
orders[,table(rma_yn, cost_hastax<0)]
rma_yn FALSE TRUE
0 14732118 0
1 4548 136003
顾客实际付款金额
orders[,table(rma_yn, gmv<0)]
rma_yn FALSE TRUE
0 14730505 1613
1 8015 132536
gmv去税
orders[,table(rma_yn, revenue<0)]
rma_yn FALSE TRUE
0 14729398 2720
1 8041 132510
优惠金额总计
orders[,table(rma_yn, marketing<0)]
rma_yn FALSE TRUE
0 14732115 3
1 119766 20785
重复出现次数从2到6,可以考虑剔除重复3次及以上的pm_list_id对应的订单
dup_pm = tidyfst::count_dt(orders,pm_list_id) %>% filter_dt(n>1)
dup_pm[,table(n)]n
2 3 4 5 6
157136 3872 29 5 8
pm_list_id相同时,很多字段也会出现不相同的情况,如gp_amt, gmv, pm_num, acc_sale, act_sale,可能是退了一部分货,举例说明
instance = orders%>%filter(pm_list_id==11166138524)
datatable(instance,extensions = 'FixedColumns', options = list(scrollX=TRUE,fixedColumns = TRUE))check_list = dup_pm[n==2,unique(pm_list_id)]
orders[pm_list_id %in% check_list,table(rma_yn)]rma_yn
0 1
184230 130042
按照gmv(用户实付金额)来计算销量
orders[, table(gmv>0)]
FALSE TRUE
882002 13990667
orders[,table(cat1_id)]cat1_id
0 103 107 108 953710 955306 955405 960464
13 5 1532958 20 10875747 594628 271672 303542
962285 964106 971591 972052 972103
681885 599758 253 3798 8390
orders[,table(cat1_name)]cat1_name
参茸花茶、休闲零食 成人用品 家电 美妆个护
5 594628 20 255655
门店专用 维生素、钙剂 药城专用 药妆个护
253 681885 8390 16017
医疗器械 医疗器械、健康电器 医药拓展 隐形眼镜
273074 1259884 71 303542
营养保健 云药房 中西药品 滋补保健
174 3727 10875747 599584
length(unique(orders$cat2_id))[1] 115
length(unique(orders$cat3_id))[1] 720
按照一级类目名称分类
trend = orders[gmv > 0, .(total_sales = log(sum(gmv))), by = .(ordr_time_cut = cut(ordr_time, breaks = "month"), cat1_name)]
trend$ordr_time_cut = as.Date(trend$ordr_time_cut)
getPalette = colorRampPalette(brewer.pal(5, "Set1"))
p = ggplot(data = trend,aes(x=ordr_time_cut,y=total_sales,group=cat1_name,color=as.factor(cat1_name)))+
geom_line()+
geom_point()+
labs(x = "Time", y = "Sales(log) by category 1")+
scale_x_date(date_breaks="1 months",date_labels="%b /%m")+
theme(axis.text.x = element_text(angle=45, hjust=1, vjust=1))+
scale_color_manual(name = "Cat1_name", values = getPalette(length(unique(trend$cat1_name))))
ggplotly(p)按照一级类目ID分类
trend = orders[gmv > 0, .(total_sales = log(sum(gmv))), by = .(ordr_time_cut = cut(ordr_time, breaks = "month"), cat1_id)]
trend$ordr_time_cut = as.Date(trend$ordr_time_cut)
trend_d = dcast(trend,ordr_time_cut~cat1_id,value.var = "total_sales")
dygraph(trend_d, main = "Sales(log) by cat1_id")%>%
dyRangeSelector(dateWindow = c("2018-12-01", "2020-08-01"))%>%
dyOptions(colors = RColorBrewer::brewer.pal(6, "Set1"))%>%
dyOptions(drawPoints = TRUE, pointSize = 2) %>%
dyEvent("2020-1-23", "COVID-19", labelLoc = "bottom")%>%
dyLegend(show = "follow")%>%
dyHighlight(highlightSeriesOpts = list(strokeWidth = 3)) %>%
dyCSS(textConnection("
.dygraph-legend > span { display: none; }
.dygraph-legend > span.highlight { display: inline; }
"))按照二级类目名称分类
trend = orders[gmv > 0, .(total_sales = log(sum(gmv))), by = .(ordr_time_cut = cut(ordr_time, breaks = "month"), cat2_name)]
trend$ordr_time_cut = as.Date(trend$ordr_time_cut)
getPalette = colorRampPalette(brewer.pal(5, "Set1"))
split_list = split(unique(trend$cat2_name),1:10)plotlist = list()
for(i in 1:length(split_list)){
p = ggplot(data = trend[cat2_name %in% split_list[[i]]],aes(x=ordr_time_cut,y=total_sales,group=cat2_name,color=as.factor(cat2_name)))+
geom_line()+
geom_point()+
labs(x = "Time", y = "Sales(log) by cat2_name")+
scale_x_date(date_breaks="1 months",date_labels="%b /%m")+
theme(axis.text.x = element_text(angle=45, hjust=1, vjust=1))+
scale_color_manual(name = "cat2_name", values = getPalette(length(split_list[[i]])))
plotlist[[i]] = print(ggplotly(p))
}
htmltools::tagList(setNames(plotlist, NULL))按照二级类目ID分类
trend = orders[gmv > 0, .(total_sales = log(sum(gmv))), by = .(ordr_time_cut = cut(ordr_time, breaks = "month"), cat2_id)]
trend$ordr_time_cut = as.Date(trend$ordr_time_cut)
trend_d = dcast(trend,ordr_time_cut~cat2_id,value.var = "total_sales")
total_num = length(unique(trend$cat2_id))
split_list = split(x=2:total_num,f=1:11)myfun = function(cut_list){
tmp = c(1,split_list[[cut_list]])
dygraph(trend_d[,..tmp], main = "Sales(log) by cat2_id",group = "My group")%>%
dyRangeSelector(dateWindow = c("2018-12-01", "2020-08-01"))%>%
dyOptions(colors = RColorBrewer::brewer.pal(6, "Set1"))%>%
dyOptions(drawPoints = TRUE, pointSize = 2) %>%
dyEvent("2020-1-23", "COVID-19", labelLoc = "bottom")%>%
dyLegend(show = "follow")%>%
dyHighlight(highlightSeriesOpts = list(strokeWidth = 3)) %>%
dyCSS(textConnection("
.dygraph-legend > span { display: none; }
.dygraph-legend > span.highlight { display: inline; }
"))
}
res = lapply(1:length(split_list), function(i) myfun(i))
htmltools::tagList(res)